1 Problem and Approach

This notebook is built on the problem and data of Home Credit predicting default risk. The data and Kaggle information can be found at this link https://www.kaggle.com/competitions/home-credit-default-risk/code.

The data used in this project is regarding an individuals characteristics; income, occupation, family size, etc. There is a train and test file with similar variables. There is also data about an individual’s transactions, balances, and other financial information.

Home Credit wants to help customers have a good experience by more accurately predicting who should be approved with the ability to pay back the loan and those who should be rejected who will be unable to pay it back.

We will build a supervised categorical model, predicting whether someone should be approved for a loan (1 meaning should not be approved, 0 meaning should be approved). We will use the data from Kaggle about the individual, their transactions, and other financial data. We’ll explore the data, clean it, create visualizations, and perform feature engineering to help maximize the effectiveness of our model.

#Questions What data is not need and can be removed? What variables have a high correlation to the target variable? What occupations have the highest number of default? What income types have the highest number of default? What is the target variable?

2 Data Prep

2.1 Load Packages

#Load packages
library(e1071)
library(psych)
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Loading required package: lattice
library(rminer)
library(rmarkdown)
library(tictoc) 
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” lubridate 1.9.2     âś” tibble    3.2.1
## âś” purrr     1.0.2     âś” tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– ggplot2::%+%()   masks psych::%+%()
## âś– ggplot2::alpha() masks psych::alpha()
## âś– dplyr::filter()  masks stats::filter()
## âś– dplyr::lag()     masks stats::lag()
## âś– purrr::lift()    masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(C50)
library(matrixStats)
## 
## Attaching package: 'matrixStats'
## 
## The following object is masked from 'package:dplyr':
## 
##     count
library(knitr)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(xgboost)
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
library(DataExplorer)

2.2 Import Data

tic()
# Set working directory
cloud_wd <- getwd()
setwd(cloud_wd)

# Read the data into data frames and set strings as factors
app_test <- read.csv(file = "application_test.csv", stringsAsFactors = TRUE)
app_train <- read.csv(file = "application_train.csv", stringsAsFactors = TRUE)
bur_bal <- read.csv(file = "bureau_balance.csv", stringsAsFactors = TRUE)
bur <- read.csv(file = "bureau.csv", stringsAsFactors = TRUE)
cc_bal <- read.csv(file = "credit_card_balance.csv", stringsAsFactors = TRUE)
inst_pay <- read.csv(file = "installments_payments.csv", stringsAsFactors = TRUE)
pos <- read.csv(file = "POS_CASH_balance.csv", stringsAsFactors = TRUE)
pre_app <- read.csv(file = "previous_application.csv", stringsAsFactors = TRUE)

app_train <- app_train %>% mutate(TARGET = factor(TARGET))

2.3 Data Summary

# Train and test data structure
str(app_train)
## 'data.frame':    307511 obs. of  122 variables:
##  $ SK_ID_CURR                  : int  100002 100003 100004 100006 100007 100008 100009 100010 100011 100012 ...
##  $ TARGET                      : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
##  $ NAME_CONTRACT_TYPE          : Factor w/ 2 levels "Cash loans","Revolving loans": 1 1 2 1 1 1 1 1 1 2 ...
##  $ CODE_GENDER                 : Factor w/ 3 levels "F","M","XNA": 2 1 2 1 2 2 1 2 1 2 ...
##  $ FLAG_OWN_CAR                : Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 2 2 1 1 ...
##  $ FLAG_OWN_REALTY             : Factor w/ 2 levels "N","Y": 2 1 2 2 2 2 2 2 2 2 ...
##  $ CNT_CHILDREN                : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ AMT_INCOME_TOTAL            : num  202500 270000 67500 135000 121500 ...
##  $ AMT_CREDIT                  : num  406598 1293502 135000 312682 513000 ...
##  $ AMT_ANNUITY                 : num  24700 35698 6750 29686 21866 ...
##  $ AMT_GOODS_PRICE             : num  351000 1129500 135000 297000 513000 ...
##  $ NAME_TYPE_SUITE             : Factor w/ 8 levels "","Children",..: 8 3 8 8 8 7 8 8 2 8 ...
##  $ NAME_INCOME_TYPE            : Factor w/ 8 levels "Businessman",..: 8 5 8 8 8 5 2 5 4 8 ...
##  $ NAME_EDUCATION_TYPE         : Factor w/ 5 levels "Academic degree",..: 5 2 5 5 5 5 2 2 5 5 ...
##  $ NAME_FAMILY_STATUS          : Factor w/ 6 levels "Civil marriage",..: 4 2 4 1 4 2 2 2 2 4 ...
##  $ NAME_HOUSING_TYPE           : Factor w/ 6 levels "Co-op apartment",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ REGION_POPULATION_RELATIVE  : num  0.0188 0.00354 0.01003 0.00802 0.02866 ...
##  $ DAYS_BIRTH                  : int  -9461 -16765 -19046 -19005 -19932 -16941 -13778 -18850 -20099 -14469 ...
##  $ DAYS_EMPLOYED               : int  -637 -1188 -225 -3039 -3038 -1588 -3130 -449 365243 -2019 ...
##  $ DAYS_REGISTRATION           : num  -3648 -1186 -4260 -9833 -4311 ...
##  $ DAYS_ID_PUBLISH             : int  -2120 -291 -2531 -2437 -3458 -477 -619 -2379 -3514 -3992 ...
##  $ OWN_CAR_AGE                 : num  NA NA 26 NA NA NA 17 8 NA NA ...
##  $ FLAG_MOBIL                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FLAG_EMP_PHONE              : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ FLAG_WORK_PHONE             : int  0 0 1 0 0 1 0 1 0 0 ...
##  $ FLAG_CONT_MOBILE            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FLAG_PHONE                  : int  1 1 1 0 0 1 1 0 0 0 ...
##  $ FLAG_EMAIL                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OCCUPATION_TYPE             : Factor w/ 19 levels "","Accountants",..: 10 5 10 10 5 10 2 12 1 10 ...
##  $ CNT_FAM_MEMBERS             : num  1 2 1 2 1 2 3 2 2 1 ...
##  $ REGION_RATING_CLIENT        : int  2 1 2 2 2 2 2 3 2 2 ...
##  $ REGION_RATING_CLIENT_W_CITY : int  2 1 2 2 2 2 2 3 2 2 ...
##  $ WEEKDAY_APPR_PROCESS_START  : Factor w/ 7 levels "FRIDAY","MONDAY",..: 7 2 2 7 5 7 4 2 7 5 ...
##  $ HOUR_APPR_PROCESS_START     : int  10 11 9 17 11 16 16 16 14 8 ...
##  $ REG_REGION_NOT_LIVE_REGION  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_REGION_NOT_WORK_REGION  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LIVE_REGION_NOT_WORK_REGION : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_CITY_NOT_LIVE_CITY      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_CITY_NOT_WORK_CITY      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ LIVE_CITY_NOT_WORK_CITY     : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ ORGANIZATION_TYPE           : Factor w/ 58 levels "Advertising",..: 6 40 12 6 38 34 6 34 58 10 ...
##  $ EXT_SOURCE_1                : num  0.083 0.311 NA NA NA ...
##  $ EXT_SOURCE_2                : num  0.263 0.622 0.556 0.65 0.323 ...
##  $ EXT_SOURCE_3                : num  0.139 NA 0.73 NA NA ...
##  $ APARTMENTS_AVG              : num  0.0247 0.0959 NA NA NA NA NA NA NA NA ...
##  $ BASEMENTAREA_AVG            : num  0.0369 0.0529 NA NA NA NA NA NA NA NA ...
##  $ YEARS_BEGINEXPLUATATION_AVG : num  0.972 0.985 NA NA NA ...
##  $ YEARS_BUILD_AVG             : num  0.619 0.796 NA NA NA ...
##  $ COMMONAREA_AVG              : num  0.0143 0.0605 NA NA NA NA NA NA NA NA ...
##  $ ELEVATORS_AVG               : num  0 0.08 NA NA NA NA NA NA NA NA ...
##  $ ENTRANCES_AVG               : num  0.069 0.0345 NA NA NA NA NA NA NA NA ...
##  $ FLOORSMAX_AVG               : num  0.0833 0.2917 NA NA NA ...
##  $ FLOORSMIN_AVG               : num  0.125 0.333 NA NA NA ...
##  $ LANDAREA_AVG                : num  0.0369 0.013 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAPARTMENTS_AVG        : num  0.0202 0.0773 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAREA_AVG              : num  0.019 0.0549 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAPARTMENTS_AVG     : num  0 0.0039 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAREA_AVG           : num  0 0.0098 NA NA NA NA NA NA NA NA ...
##  $ APARTMENTS_MODE             : num  0.0252 0.0924 NA NA NA NA NA NA NA NA ...
##  $ BASEMENTAREA_MODE           : num  0.0383 0.0538 NA NA NA NA NA NA NA NA ...
##  $ YEARS_BEGINEXPLUATATION_MODE: num  0.972 0.985 NA NA NA ...
##  $ YEARS_BUILD_MODE            : num  0.634 0.804 NA NA NA ...
##  $ COMMONAREA_MODE             : num  0.0144 0.0497 NA NA NA NA NA NA NA NA ...
##  $ ELEVATORS_MODE              : num  0 0.0806 NA NA NA NA NA NA NA NA ...
##  $ ENTRANCES_MODE              : num  0.069 0.0345 NA NA NA NA NA NA NA NA ...
##  $ FLOORSMAX_MODE              : num  0.0833 0.2917 NA NA NA ...
##  $ FLOORSMIN_MODE              : num  0.125 0.333 NA NA NA ...
##  $ LANDAREA_MODE               : num  0.0377 0.0128 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAPARTMENTS_MODE       : num  0.022 0.079 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAREA_MODE             : num  0.0198 0.0554 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAPARTMENTS_MODE    : num  0 0 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAREA_MODE          : num  0 0 NA NA NA NA NA NA NA NA ...
##  $ APARTMENTS_MEDI             : num  0.025 0.0968 NA NA NA NA NA NA NA NA ...
##  $ BASEMENTAREA_MEDI           : num  0.0369 0.0529 NA NA NA NA NA NA NA NA ...
##  $ YEARS_BEGINEXPLUATATION_MEDI: num  0.972 0.985 NA NA NA ...
##  $ YEARS_BUILD_MEDI            : num  0.624 0.799 NA NA NA ...
##  $ COMMONAREA_MEDI             : num  0.0144 0.0608 NA NA NA NA NA NA NA NA ...
##  $ ELEVATORS_MEDI              : num  0 0.08 NA NA NA NA NA NA NA NA ...
##  $ ENTRANCES_MEDI              : num  0.069 0.0345 NA NA NA NA NA NA NA NA ...
##  $ FLOORSMAX_MEDI              : num  0.0833 0.2917 NA NA NA ...
##  $ FLOORSMIN_MEDI              : num  0.125 0.333 NA NA NA ...
##  $ LANDAREA_MEDI               : num  0.0375 0.0132 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAPARTMENTS_MEDI       : num  0.0205 0.0787 NA NA NA NA NA NA NA NA ...
##  $ LIVINGAREA_MEDI             : num  0.0193 0.0558 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAPARTMENTS_MEDI    : num  0 0.0039 NA NA NA NA NA NA NA NA ...
##  $ NONLIVINGAREA_MEDI          : num  0 0.01 NA NA NA NA NA NA NA NA ...
##  $ FONDKAPREMONT_MODE          : Factor w/ 5 levels "","not specified",..: 4 4 1 1 1 1 1 1 1 1 ...
##  $ HOUSETYPE_MODE              : Factor w/ 4 levels "","block of flats",..: 2 2 1 1 1 1 1 1 1 1 ...
##  $ TOTALAREA_MODE              : num  0.0149 0.0714 NA NA NA NA NA NA NA NA ...
##  $ WALLSMATERIAL_MODE          : Factor w/ 8 levels "","Block","Mixed",..: 7 2 1 1 1 1 1 1 1 1 ...
##  $ EMERGENCYSTATE_MODE         : Factor w/ 3 levels "","No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
##  $ OBS_30_CNT_SOCIAL_CIRCLE    : num  2 1 0 2 0 0 1 2 1 2 ...
##  $ DEF_30_CNT_SOCIAL_CIRCLE    : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ OBS_60_CNT_SOCIAL_CIRCLE    : num  2 1 0 2 0 0 1 2 1 2 ...
##  $ DEF_60_CNT_SOCIAL_CIRCLE    : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ DAYS_LAST_PHONE_CHANGE      : num  -1134 -828 -815 -617 -1106 ...
##  $ FLAG_DOCUMENT_2             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FLAG_DOCUMENT_3             : int  1 1 0 1 0 1 0 1 1 0 ...
##  $ FLAG_DOCUMENT_4             : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]
str(app_test)
## 'data.frame':    48744 obs. of  121 variables:
##  $ SK_ID_CURR                  : int  100001 100005 100013 100028 100038 100042 100057 100065 100066 100067 ...
##  $ NAME_CONTRACT_TYPE          : Factor w/ 2 levels "Cash loans","Revolving loans": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CODE_GENDER                 : Factor w/ 2 levels "F","M": 1 2 2 1 2 1 2 2 1 1 ...
##  $ FLAG_OWN_CAR                : Factor w/ 2 levels "N","Y": 1 1 2 1 2 2 2 1 1 2 ...
##  $ FLAG_OWN_REALTY             : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 2 2 2 2 ...
##  $ CNT_CHILDREN                : int  0 0 0 2 1 0 2 0 0 1 ...
##  $ AMT_INCOME_TOTAL            : num  135000 99000 202500 315000 180000 ...
##  $ AMT_CREDIT                  : num  568800 222768 663264 1575000 625500 ...
##  $ AMT_ANNUITY                 : num  20560 17370 69777 49018 32067 ...
##  $ AMT_GOODS_PRICE             : num  450000 180000 630000 1575000 625500 ...
##  $ NAME_TYPE_SUITE             : Factor w/ 8 levels "","Children",..: 8 8 1 8 8 8 8 8 8 3 ...
##  $ NAME_INCOME_TYPE            : Factor w/ 7 levels "Businessman",..: 7 7 7 7 7 4 7 7 4 7 ...
##  $ NAME_EDUCATION_TYPE         : Factor w/ 5 levels "Academic degree",..: 2 5 2 5 5 5 2 2 2 2 ...
##  $ NAME_FAMILY_STATUS          : Factor w/ 5 levels "Civil marriage",..: 2 2 2 2 2 2 2 4 2 1 ...
##  $ NAME_HOUSING_TYPE           : Factor w/ 6 levels "Co-op apartment",..: 2 2 2 2 2 2 2 6 2 2 ...
##  $ REGION_POPULATION_RELATIVE  : num  0.0188 0.0358 0.0191 0.0264 0.01 ...
##  $ DAYS_BIRTH                  : int  -19241 -18064 -20038 -13976 -13040 -18604 -16685 -9516 -12744 -10395 ...
##  $ DAYS_EMPLOYED               : int  -2329 -4469 -4458 -1866 -2191 -12009 -2580 -1387 -1013 -2625 ...
##  $ DAYS_REGISTRATION           : num  -5170 -9118 -2175 -2000 -4000 ...
##  $ DAYS_ID_PUBLISH             : int  -812 -1623 -3503 -4208 -4262 -2027 -241 -2055 -3171 -3041 ...
##  $ OWN_CAR_AGE                 : num  NA NA 5 NA 16 10 3 NA NA 5 ...
##  $ FLAG_MOBIL                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FLAG_EMP_PHONE              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FLAG_WORK_PHONE             : int  0 0 0 0 1 0 0 1 0 1 ...
##  $ FLAG_CONT_MOBILE            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FLAG_PHONE                  : int  0 0 0 1 0 1 0 1 0 1 ...
##  $ FLAG_EMAIL                  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ OCCUPATION_TYPE             : Factor w/ 19 levels "","Accountants",..: 1 11 6 16 1 6 7 5 5 16 ...
##  $ CNT_FAM_MEMBERS             : num  2 2 2 4 3 2 4 1 2 3 ...
##  $ REGION_RATING_CLIENT        : int  2 2 2 2 2 2 2 2 1 2 ...
##  $ REGION_RATING_CLIENT_W_CITY : int  2 2 2 2 2 2 2 2 1 2 ...
##  $ WEEKDAY_APPR_PROCESS_START  : Factor w/ 7 levels "FRIDAY","MONDAY",..: 6 1 2 7 1 2 5 1 5 6 ...
##  $ HOUR_APPR_PROCESS_START     : int  18 9 14 11 5 15 9 7 18 14 ...
##  $ REG_REGION_NOT_LIVE_REGION  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_REGION_NOT_WORK_REGION  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LIVE_REGION_NOT_WORK_REGION : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_CITY_NOT_LIVE_CITY      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ REG_CITY_NOT_WORK_CITY      : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ LIVE_CITY_NOT_WORK_CITY     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ ORGANIZATION_TYPE           : Factor w/ 58 levels "Advertising",..: 29 43 55 6 6 12 27 43 40 47 ...
##  $ EXT_SOURCE_1                : num  0.753 0.565 NA 0.526 0.202 ...
##  $ EXT_SOURCE_2                : num  0.79 0.292 0.7 0.51 0.426 ...
##  $ EXT_SOURCE_3                : num  0.16 0.433 0.611 0.613 NA ...
##  $ APARTMENTS_AVG              : num  0.066 NA NA 0.305 NA ...
##  $ BASEMENTAREA_AVG            : num  0.059 NA NA 0.197 NA ...
##  $ YEARS_BEGINEXPLUATATION_AVG : num  0.973 NA NA 0.997 NA ...
##  $ YEARS_BUILD_AVG             : num  NA NA NA 0.959 NA ...
##  $ COMMONAREA_AVG              : num  NA NA NA 0.117 NA ...
##  $ ELEVATORS_AVG               : num  NA NA NA 0.32 NA 0.16 NA NA 0 NA ...
##  $ ENTRANCES_AVG               : num  0.138 NA NA 0.276 NA ...
##  $ FLOORSMAX_AVG               : num  0.125 NA NA 0.375 NA ...
##  $ FLOORSMIN_AVG               : num  NA NA NA 0.0417 NA 0.375 NA NA NA NA ...
##  $ LANDAREA_AVG                : num  NA NA NA 0.204 NA ...
##  $ LIVINGAPARTMENTS_AVG        : num  NA NA NA 0.24 NA ...
##  $ LIVINGAREA_AVG              : num  0.0505 NA NA 0.3673 NA ...
##  $ NONLIVINGAPARTMENTS_AVG     : num  NA NA NA 0.0386 NA 0.0116 NA NA NA NA ...
##  $ NONLIVINGAREA_AVG           : num  NA NA NA 0.08 NA 0.0731 NA NA NA NA ...
##  $ APARTMENTS_MODE             : num  0.0672 NA NA 0.3109 NA ...
##  $ BASEMENTAREA_MODE           : num  0.0612 NA NA 0.2049 NA ...
##  $ YEARS_BEGINEXPLUATATION_MODE: num  0.973 NA NA 0.997 NA ...
##  $ YEARS_BUILD_MODE            : num  NA NA NA 0.961 NA ...
##  $ COMMONAREA_MODE             : num  NA NA NA 0.118 NA ...
##  $ ELEVATORS_MODE              : num  NA NA NA 0.322 NA ...
##  $ ENTRANCES_MODE              : num  0.138 NA NA 0.276 NA ...
##  $ FLOORSMAX_MODE              : num  0.125 NA NA 0.375 NA ...
##  $ FLOORSMIN_MODE              : num  NA NA NA 0.0417 NA 0.375 NA NA NA NA ...
##  $ LANDAREA_MODE               : num  NA NA NA 0.209 NA ...
##  $ LIVINGAPARTMENTS_MODE       : num  NA NA NA 0.263 NA ...
##  $ LIVINGAREA_MODE             : num  0.0526 NA NA 0.3827 NA ...
##  $ NONLIVINGAPARTMENTS_MODE    : num  NA NA NA 0.0389 NA 0.0117 NA NA NA NA ...
##  $ NONLIVINGAREA_MODE          : num  NA NA NA 0.0847 NA 0.0774 NA NA NA NA ...
##  $ APARTMENTS_MEDI             : num  0.0666 NA NA 0.3081 NA ...
##  $ BASEMENTAREA_MEDI           : num  0.059 NA NA 0.197 NA ...
##  $ YEARS_BEGINEXPLUATATION_MEDI: num  0.973 NA NA 0.997 NA ...
##  $ YEARS_BUILD_MEDI            : num  NA NA NA 0.96 NA ...
##  $ COMMONAREA_MEDI             : num  NA NA NA 0.117 NA ...
##  $ ELEVATORS_MEDI              : num  NA NA NA 0.32 NA 0.16 NA NA 0 NA ...
##  $ ENTRANCES_MEDI              : num  0.138 NA NA 0.276 NA ...
##  $ FLOORSMAX_MEDI              : num  0.125 NA NA 0.375 NA ...
##  $ FLOORSMIN_MEDI              : num  NA NA NA 0.0417 NA 0.375 NA NA NA NA ...
##  $ LANDAREA_MEDI               : num  NA NA NA 0.208 NA ...
##  $ LIVINGAPARTMENTS_MEDI       : num  NA NA NA 0.245 NA ...
##  $ LIVINGAREA_MEDI             : num  0.0514 NA NA 0.3739 NA ...
##  $ NONLIVINGAPARTMENTS_MEDI    : num  NA NA NA 0.0388 NA 0.0116 NA NA NA NA ...
##  $ NONLIVINGAREA_MEDI          : num  NA NA NA 0.0817 NA 0.0746 NA NA NA NA ...
##  $ FONDKAPREMONT_MODE          : Factor w/ 5 levels "","not specified",..: 1 1 1 4 1 2 1 1 1 1 ...
##  $ HOUSETYPE_MODE              : Factor w/ 4 levels "","block of flats",..: 2 1 1 2 1 2 1 1 2 1 ...
##  $ TOTALAREA_MODE              : num  0.0392 NA NA 0.37 NA ...
##  $ WALLSMATERIAL_MODE          : Factor w/ 8 levels "","Block","Mixed",..: 7 1 1 6 1 2 1 1 7 1 ...
##  $ EMERGENCYSTATE_MODE         : Factor w/ 3 levels "","No","Yes": 2 1 1 2 1 2 1 1 2 1 ...
##  $ OBS_30_CNT_SOCIAL_CIRCLE    : num  0 0 0 0 0 0 1 0 0 4 ...
##  $ DEF_30_CNT_SOCIAL_CIRCLE    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OBS_60_CNT_SOCIAL_CIRCLE    : num  0 0 0 0 0 0 1 0 0 4 ...
##  $ DEF_60_CNT_SOCIAL_CIRCLE    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DAYS_LAST_PHONE_CHANGE      : num  -1740 0 -856 -1805 -821 ...
##  $ FLAG_DOCUMENT_2             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FLAG_DOCUMENT_3             : int  1 1 0 1 1 0 1 0 1 1 ...
##  $ FLAG_DOCUMENT_4             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FLAG_DOCUMENT_5             : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]
# Summarize the test and train data
summary(app_train)
##    SK_ID_CURR     TARGET           NAME_CONTRACT_TYPE CODE_GENDER  FLAG_OWN_CAR
##  Min.   :100002   0:282686   Cash loans     :278232   F  :202448   N:202924    
##  1st Qu.:189146   1: 24825   Revolving loans: 29279   M  :105059   Y:104587    
##  Median :278202                                       XNA:     4               
##  Mean   :278180                                                                
##  3rd Qu.:367142                                                                
##  Max.   :456255                                                                
##                                                                                
##  FLAG_OWN_REALTY  CNT_CHILDREN     AMT_INCOME_TOTAL      AMT_CREDIT     
##  N: 94199        Min.   : 0.0000   Min.   :    25650   Min.   :  45000  
##  Y:213312        1st Qu.: 0.0000   1st Qu.:   112500   1st Qu.: 270000  
##                  Median : 0.0000   Median :   147150   Median : 513531  
##                  Mean   : 0.4171   Mean   :   168798   Mean   : 599026  
##                  3rd Qu.: 1.0000   3rd Qu.:   202500   3rd Qu.: 808650  
##                  Max.   :19.0000   Max.   :117000000   Max.   :4050000  
##                                                                         
##   AMT_ANNUITY     AMT_GOODS_PRICE          NAME_TYPE_SUITE  
##  Min.   :  1616   Min.   :  40500   Unaccompanied  :248526  
##  1st Qu.: 16524   1st Qu.: 238500   Family         : 40149  
##  Median : 24903   Median : 450000   Spouse, partner: 11370  
##  Mean   : 27109   Mean   : 538396   Children       :  3267  
##  3rd Qu.: 34596   3rd Qu.: 679500   Other_B        :  1770  
##  Max.   :258026   Max.   :4050000                  :  1292  
##  NA's   :12       NA's   :278       (Other)        :  1137  
##              NAME_INCOME_TYPE                     NAME_EDUCATION_TYPE
##  Working             :158774   Academic degree              :   164  
##  Commercial associate: 71617   Higher education             : 74863  
##  Pensioner           : 55362   Incomplete higher            : 10277  
##  State servant       : 21703   Lower secondary              :  3816  
##  Unemployed          :    22   Secondary / secondary special:218391  
##  Student             :    18                                         
##  (Other)             :    15                                         
##             NAME_FAMILY_STATUS           NAME_HOUSING_TYPE 
##  Civil marriage      : 29775   Co-op apartment    :  1122  
##  Married             :196432   House / apartment  :272868  
##  Separated           : 19770   Municipal apartment: 11183  
##  Single / not married: 45444   Office apartment   :  2617  
##  Unknown             :     2   Rented apartment   :  4881  
##  Widow               : 16088   With parents       : 14840  
##                                                            
##  REGION_POPULATION_RELATIVE   DAYS_BIRTH     DAYS_EMPLOYED    DAYS_REGISTRATION
##  Min.   :0.00029            Min.   :-25229   Min.   :-17912   Min.   :-24672   
##  1st Qu.:0.01001            1st Qu.:-19682   1st Qu.: -2760   1st Qu.: -7480   
##  Median :0.01885            Median :-15750   Median : -1213   Median : -4504   
##  Mean   :0.02087            Mean   :-16037   Mean   : 63815   Mean   : -4986   
##  3rd Qu.:0.02866            3rd Qu.:-12413   3rd Qu.:  -289   3rd Qu.: -2010   
##  Max.   :0.07251            Max.   : -7489   Max.   :365243   Max.   :     0   
##                                                                                
##  DAYS_ID_PUBLISH  OWN_CAR_AGE       FLAG_MOBIL FLAG_EMP_PHONE  
##  Min.   :-7197   Min.   : 0.00    Min.   :0    Min.   :0.0000  
##  1st Qu.:-4299   1st Qu.: 5.00    1st Qu.:1    1st Qu.:1.0000  
##  Median :-3254   Median : 9.00    Median :1    Median :1.0000  
##  Mean   :-2994   Mean   :12.06    Mean   :1    Mean   :0.8199  
##  3rd Qu.:-1720   3rd Qu.:15.00    3rd Qu.:1    3rd Qu.:1.0000  
##  Max.   :    0   Max.   :91.00    Max.   :1    Max.   :1.0000  
##                  NA's   :202929                                
##  FLAG_WORK_PHONE  FLAG_CONT_MOBILE   FLAG_PHONE       FLAG_EMAIL     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.1994   Mean   :0.9981   Mean   :0.2811   Mean   :0.05672  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##                                                                      
##     OCCUPATION_TYPE  CNT_FAM_MEMBERS  REGION_RATING_CLIENT
##             :96391   Min.   : 1.000   Min.   :1.000       
##  Laborers   :55186   1st Qu.: 2.000   1st Qu.:2.000       
##  Sales staff:32102   Median : 2.000   Median :2.000       
##  Core staff :27570   Mean   : 2.153   Mean   :2.052       
##  Managers   :21371   3rd Qu.: 3.000   3rd Qu.:2.000       
##  Drivers    :18603   Max.   :20.000   Max.   :3.000       
##  (Other)    :56288   NA's   :2                            
##  REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START
##  Min.   :1.000               FRIDAY   :50338            Min.   : 0.00          
##  1st Qu.:2.000               MONDAY   :50714            1st Qu.:10.00          
##  Median :2.000               SATURDAY :33852            Median :12.00          
##  Mean   :2.032               SUNDAY   :16181            Mean   :12.06          
##  3rd Qu.:2.000               THURSDAY :50591            3rd Qu.:14.00          
##  Max.   :3.000               TUESDAY  :53901            Max.   :23.00          
##                              WEDNESDAY:51934                                   
##  REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION
##  Min.   :0.00000            Min.   :0.00000           
##  1st Qu.:0.00000            1st Qu.:0.00000           
##  Median :0.00000            Median :0.00000           
##  Mean   :0.01514            Mean   :0.05077           
##  3rd Qu.:0.00000            3rd Qu.:0.00000           
##  Max.   :1.00000            Max.   :1.00000           
##                                                       
##  LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY
##  Min.   :0.00000             Min.   :0.00000        Min.   :0.0000        
##  1st Qu.:0.00000             1st Qu.:0.00000        1st Qu.:0.0000        
##  Median :0.00000             Median :0.00000        Median :0.0000        
##  Mean   :0.04066             Mean   :0.07817        Mean   :0.2305        
##  3rd Qu.:0.00000             3rd Qu.:0.00000        3rd Qu.:0.0000        
##  Max.   :1.00000             Max.   :1.00000        Max.   :1.0000        
##                                                                           
##  LIVE_CITY_NOT_WORK_CITY              ORGANIZATION_TYPE   EXT_SOURCE_1   
##  Min.   :0.0000          Business Entity Type 3: 67992   Min.   :0.01    
##  1st Qu.:0.0000          XNA                   : 55374   1st Qu.:0.33    
##  Median :0.0000          Self-employed         : 38412   Median :0.51    
##  Mean   :0.1796          Other                 : 16683   Mean   :0.50    
##  3rd Qu.:0.0000          Medicine              : 11193   3rd Qu.:0.68    
##  Max.   :1.0000          Business Entity Type 2: 10553   Max.   :0.96    
##                          (Other)               :107304   NA's   :173378  
##   EXT_SOURCE_2     EXT_SOURCE_3   APARTMENTS_AVG   BASEMENTAREA_AVG
##  Min.   :0.0000   Min.   :0.00    Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.3925   1st Qu.:0.37    1st Qu.:0.06     1st Qu.:0.04    
##  Median :0.5660   Median :0.54    Median :0.09     Median :0.08    
##  Mean   :0.5144   Mean   :0.51    Mean   :0.12     Mean   :0.09    
##  3rd Qu.:0.6636   3rd Qu.:0.67    3rd Qu.:0.15     3rd Qu.:0.11    
##  Max.   :0.8550   Max.   :0.90    Max.   :1.00     Max.   :1.00    
##  NA's   :660      NA's   :60965   NA's   :156061   NA's   :179943  
##  YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG  COMMONAREA_AVG   ELEVATORS_AVG   
##  Min.   :0.00                Min.   :0.00     Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.98                1st Qu.:0.69     1st Qu.:0.01     1st Qu.:0.00    
##  Median :0.98                Median :0.76     Median :0.02     Median :0.00    
##  Mean   :0.98                Mean   :0.75     Mean   :0.04     Mean   :0.08    
##  3rd Qu.:0.99                3rd Qu.:0.82     3rd Qu.:0.05     3rd Qu.:0.12    
##  Max.   :1.00                Max.   :1.00     Max.   :1.00     Max.   :1.00    
##  NA's   :150007              NA's   :204488   NA's   :214865   NA's   :163891  
##  ENTRANCES_AVG    FLOORSMAX_AVG    FLOORSMIN_AVG     LANDAREA_AVG   
##  Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.07     1st Qu.:0.17     1st Qu.:0.08     1st Qu.:0.02    
##  Median :0.14     Median :0.17     Median :0.21     Median :0.05    
##  Mean   :0.15     Mean   :0.23     Mean   :0.23     Mean   :0.07    
##  3rd Qu.:0.21     3rd Qu.:0.33     3rd Qu.:0.38     3rd Qu.:0.09    
##  Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
##  NA's   :154828   NA's   :153020   NA's   :208642   NA's   :182590  
##  LIVINGAPARTMENTS_AVG LIVINGAREA_AVG   NONLIVINGAPARTMENTS_AVG
##  Min.   :0.00         Min.   :0.00     Min.   :0.00           
##  1st Qu.:0.05         1st Qu.:0.05     1st Qu.:0.00           
##  Median :0.08         Median :0.07     Median :0.00           
##  Mean   :0.10         Mean   :0.11     Mean   :0.01           
##  3rd Qu.:0.12         3rd Qu.:0.13     3rd Qu.:0.00           
##  Max.   :1.00         Max.   :1.00     Max.   :1.00           
##  NA's   :210199       NA's   :154350   NA's   :213514         
##  NONLIVINGAREA_AVG APARTMENTS_MODE  BASEMENTAREA_MODE
##  Min.   :0.00      Min.   :0.00     Min.   :0.00     
##  1st Qu.:0.00      1st Qu.:0.05     1st Qu.:0.04     
##  Median :0.00      Median :0.08     Median :0.07     
##  Mean   :0.03      Mean   :0.11     Mean   :0.09     
##  3rd Qu.:0.03      3rd Qu.:0.14     3rd Qu.:0.11     
##  Max.   :1.00      Max.   :1.00     Max.   :1.00     
##  NA's   :169682    NA's   :156061   NA's   :179943   
##  YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE 
##  Min.   :0.00                 Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.98                 1st Qu.:0.70     1st Qu.:0.01    
##  Median :0.98                 Median :0.76     Median :0.02    
##  Mean   :0.98                 Mean   :0.76     Mean   :0.04    
##  3rd Qu.:0.99                 3rd Qu.:0.82     3rd Qu.:0.05    
##  Max.   :1.00                 Max.   :1.00     Max.   :1.00    
##  NA's   :150007               NA's   :204488   NA's   :214865  
##  ELEVATORS_MODE   ENTRANCES_MODE   FLOORSMAX_MODE   FLOORSMIN_MODE  
##  Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.00     1st Qu.:0.07     1st Qu.:0.17     1st Qu.:0.08    
##  Median :0.00     Median :0.14     Median :0.17     Median :0.21    
##  Mean   :0.07     Mean   :0.15     Mean   :0.22     Mean   :0.23    
##  3rd Qu.:0.12     3rd Qu.:0.21     3rd Qu.:0.33     3rd Qu.:0.38    
##  Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
##  NA's   :163891   NA's   :154828   NA's   :153020   NA's   :208642  
##  LANDAREA_MODE    LIVINGAPARTMENTS_MODE LIVINGAREA_MODE 
##  Min.   :0.00     Min.   :0.00          Min.   :0.00    
##  1st Qu.:0.02     1st Qu.:0.05          1st Qu.:0.04    
##  Median :0.05     Median :0.08          Median :0.07    
##  Mean   :0.06     Mean   :0.11          Mean   :0.11    
##  3rd Qu.:0.08     3rd Qu.:0.13          3rd Qu.:0.13    
##  Max.   :1.00     Max.   :1.00          Max.   :1.00    
##  NA's   :182590   NA's   :210199        NA's   :154350  
##  NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI  BASEMENTAREA_MEDI
##  Min.   :0.00             Min.   :0.00       Min.   :0.00     Min.   :0.00     
##  1st Qu.:0.00             1st Qu.:0.00       1st Qu.:0.06     1st Qu.:0.04     
##  Median :0.00             Median :0.00       Median :0.09     Median :0.08     
##  Mean   :0.01             Mean   :0.03       Mean   :0.12     Mean   :0.09     
##  3rd Qu.:0.00             3rd Qu.:0.02       3rd Qu.:0.15     3rd Qu.:0.11     
##  Max.   :1.00             Max.   :1.00       Max.   :1.00     Max.   :1.00     
##  NA's   :213514           NA's   :169682     NA's   :156061   NA's   :179943   
##  YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI 
##  Min.   :0.00                 Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.98                 1st Qu.:0.69     1st Qu.:0.01    
##  Median :0.98                 Median :0.76     Median :0.02    
##  Mean   :0.98                 Mean   :0.76     Mean   :0.04    
##  3rd Qu.:0.99                 3rd Qu.:0.83     3rd Qu.:0.05    
##  Max.   :1.00                 Max.   :1.00     Max.   :1.00    
##  NA's   :150007               NA's   :204488   NA's   :214865  
##  ELEVATORS_MEDI   ENTRANCES_MEDI   FLOORSMAX_MEDI   FLOORSMIN_MEDI  
##  Min.   :0.00     Min.   :0.00     Min.   :0.00     Min.   :0.00    
##  1st Qu.:0.00     1st Qu.:0.07     1st Qu.:0.17     1st Qu.:0.08    
##  Median :0.00     Median :0.14     Median :0.17     Median :0.21    
##  Mean   :0.08     Mean   :0.15     Mean   :0.23     Mean   :0.23    
##  3rd Qu.:0.12     3rd Qu.:0.21     3rd Qu.:0.33     3rd Qu.:0.38    
##  Max.   :1.00     Max.   :1.00     Max.   :1.00     Max.   :1.00    
##  NA's   :163891   NA's   :154828   NA's   :153020   NA's   :208642  
##  LANDAREA_MEDI    LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI 
##  Min.   :0.00     Min.   :0.00          Min.   :0.00    
##  1st Qu.:0.02     1st Qu.:0.05          1st Qu.:0.05    
##  Median :0.05     Median :0.08          Median :0.07    
##  Mean   :0.07     Mean   :0.10          Mean   :0.11    
##  3rd Qu.:0.09     3rd Qu.:0.12          3rd Qu.:0.13    
##  Max.   :1.00     Max.   :1.00          Max.   :1.00    
##  NA's   :182590   NA's   :210199        NA's   :154350  
##  NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI             FONDKAPREMONT_MODE
##  Min.   :0.00             Min.   :0.00                            :210295  
##  1st Qu.:0.00             1st Qu.:0.00       not specified        :  5687  
##  Median :0.00             Median :0.00       org spec account     :  5619  
##  Mean   :0.01             Mean   :0.03       reg oper account     : 73830  
##  3rd Qu.:0.00             3rd Qu.:0.03       reg oper spec account: 12080  
##  Max.   :1.00             Max.   :1.00                                     
##  NA's   :213514           NA's   :169682                                   
##           HOUSETYPE_MODE   TOTALAREA_MODE      WALLSMATERIAL_MODE
##                  :154297   Min.   :0.00                 :156341  
##  block of flats  :150503   1st Qu.:0.04     Panel       : 66040  
##  specific housing:  1499   Median :0.07     Stone, brick: 64815  
##  terraced house  :  1212   Mean   :0.10     Block       :  9253  
##                            3rd Qu.:0.13     Wooden      :  5362  
##                            Max.   :1.00     Mixed       :  2296  
##                            NA's   :148431   (Other)     :  3404  
##  EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE
##     :145755          Min.   :  0.000          Min.   : 0.0000         
##  No :159428          1st Qu.:  0.000          1st Qu.: 0.0000         
##  Yes:  2328          Median :  0.000          Median : 0.0000         
##                      Mean   :  1.422          Mean   : 0.1434         
##                      3rd Qu.:  2.000          3rd Qu.: 0.0000         
##                      Max.   :348.000          Max.   :34.0000         
##                      NA's   :1021             NA's   :1021            
##  OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE
##  Min.   :  0.000          Min.   : 0.0             Min.   :-4292.0       
##  1st Qu.:  0.000          1st Qu.: 0.0             1st Qu.:-1570.0       
##  Median :  0.000          Median : 0.0             Median : -757.0       
##  Mean   :  1.405          Mean   : 0.1             Mean   : -962.9       
##  3rd Qu.:  2.000          3rd Qu.: 0.0             3rd Qu.: -274.0       
##  Max.   :344.000          Max.   :24.0             Max.   :    0.0       
##  NA's   :1021             NA's   :1021             NA's   :1             
##  FLAG_DOCUMENT_2    FLAG_DOCUMENT_3 FLAG_DOCUMENT_4    FLAG_DOCUMENT_5  
##  Min.   :0.00e+00   Min.   :0.00    Min.   :0.00e+00   Min.   :0.00000  
##  1st Qu.:0.00e+00   1st Qu.:0.00    1st Qu.:0.00e+00   1st Qu.:0.00000  
##  Median :0.00e+00   Median :1.00    Median :0.00e+00   Median :0.00000  
##  Mean   :4.23e-05   Mean   :0.71    Mean   :8.13e-05   Mean   :0.01511  
##  3rd Qu.:0.00e+00   3rd Qu.:1.00    3rd Qu.:0.00e+00   3rd Qu.:0.00000  
##  Max.   :1.00e+00   Max.   :1.00    Max.   :1.00e+00   Max.   :1.00000  
##                                                                         
##  FLAG_DOCUMENT_6   FLAG_DOCUMENT_7     FLAG_DOCUMENT_8   FLAG_DOCUMENT_9   
##  Min.   :0.00000   Min.   :0.0000000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.0000000   Median :0.00000   Median :0.000000  
##  Mean   :0.08806   Mean   :0.0001919   Mean   :0.08138   Mean   :0.003896  
##  3rd Qu.:0.00000   3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.0000000   Max.   :1.00000   Max.   :1.000000  
##                                                                            
##  FLAG_DOCUMENT_10   FLAG_DOCUMENT_11   FLAG_DOCUMENT_12  FLAG_DOCUMENT_13  
##  Min.   :0.00e+00   Min.   :0.000000   Min.   :0.0e+00   Min.   :0.000000  
##  1st Qu.:0.00e+00   1st Qu.:0.000000   1st Qu.:0.0e+00   1st Qu.:0.000000  
##  Median :0.00e+00   Median :0.000000   Median :0.0e+00   Median :0.000000  
##  Mean   :2.28e-05   Mean   :0.003912   Mean   :6.5e-06   Mean   :0.003525  
##  3rd Qu.:0.00e+00   3rd Qu.:0.000000   3rd Qu.:0.0e+00   3rd Qu.:0.000000  
##  Max.   :1.00e+00   Max.   :1.000000   Max.   :1.0e+00   Max.   :1.000000  
##                                                                            
##  FLAG_DOCUMENT_14   FLAG_DOCUMENT_15  FLAG_DOCUMENT_16   FLAG_DOCUMENT_17   
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000000  
##  Median :0.000000   Median :0.00000   Median :0.000000   Median :0.0000000  
##  Mean   :0.002936   Mean   :0.00121   Mean   :0.009928   Mean   :0.0002667  
##  3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000000  
##                                                                             
##  FLAG_DOCUMENT_18  FLAG_DOCUMENT_19    FLAG_DOCUMENT_20    FLAG_DOCUMENT_21   
##  Min.   :0.00000   Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000  
##  1st Qu.:0.00000   1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.0000000  
##  Median :0.00000   Median :0.0000000   Median :0.0000000   Median :0.0000000  
##  Mean   :0.00813   Mean   :0.0005951   Mean   :0.0005073   Mean   :0.0003349  
##  3rd Qu.:0.00000   3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.0000000  
##  Max.   :1.00000   Max.   :1.0000000   Max.   :1.0000000   Max.   :1.0000000  
##                                                                               
##  AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY
##  Min.   :0.00               Min.   :0.00             
##  1st Qu.:0.00               1st Qu.:0.00             
##  Median :0.00               Median :0.00             
##  Mean   :0.01               Mean   :0.01             
##  3rd Qu.:0.00               3rd Qu.:0.00             
##  Max.   :4.00               Max.   :9.00             
##  NA's   :41519              NA's   :41519            
##  AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT
##  Min.   :0.00               Min.   : 0.00             Min.   :  0.00           
##  1st Qu.:0.00               1st Qu.: 0.00             1st Qu.:  0.00           
##  Median :0.00               Median : 0.00             Median :  0.00           
##  Mean   :0.03               Mean   : 0.27             Mean   :  0.27           
##  3rd Qu.:0.00               3rd Qu.: 0.00             3rd Qu.:  0.00           
##  Max.   :8.00               Max.   :27.00             Max.   :261.00           
##  NA's   :41519              NA's   :41519             NA's   :41519            
##  AMT_REQ_CREDIT_BUREAU_YEAR
##  Min.   : 0.0              
##  1st Qu.: 0.0              
##  Median : 1.0              
##  Mean   : 1.9              
##  3rd Qu.: 3.0              
##  Max.   :25.0              
##  NA's   :41519
summary(app_test)
##    SK_ID_CURR           NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR
##  Min.   :100001   Cash loans     :48305    F:32678     N:32311     
##  1st Qu.:188558   Revolving loans:  439    M:16066     Y:16433     
##  Median :277549                                                    
##  Mean   :277797                                                    
##  3rd Qu.:367556                                                    
##  Max.   :456250                                                    
##                                                                    
##  FLAG_OWN_REALTY  CNT_CHILDREN     AMT_INCOME_TOTAL    AMT_CREDIT     
##  N:15086         Min.   : 0.0000   Min.   :  26942   Min.   :  45000  
##  Y:33658         1st Qu.: 0.0000   1st Qu.: 112500   1st Qu.: 260640  
##                  Median : 0.0000   Median : 157500   Median : 450000  
##                  Mean   : 0.3971   Mean   : 178432   Mean   : 516740  
##                  3rd Qu.: 1.0000   3rd Qu.: 225000   3rd Qu.: 675000  
##                  Max.   :20.0000   Max.   :4410000   Max.   :2245500  
##                                                                       
##   AMT_ANNUITY     AMT_GOODS_PRICE          NAME_TYPE_SUITE 
##  Min.   :  2295   Min.   :  45000   Unaccompanied  :39727  
##  1st Qu.: 17973   1st Qu.: 225000   Family         : 5881  
##  Median : 26199   Median : 396000   Spouse, partner: 1448  
##  Mean   : 29426   Mean   : 462619                  :  911  
##  3rd Qu.: 37390   3rd Qu.: 630000   Children       :  408  
##  Max.   :180576   Max.   :2245500   Other_B        :  211  
##  NA's   :24                         (Other)        :  158  
##              NAME_INCOME_TYPE                    NAME_EDUCATION_TYPE
##  Businessman         :    1   Academic degree              :   41   
##  Commercial associate:11402   Higher education             :12516   
##  Pensioner           : 9273   Incomplete higher            : 1724   
##  State servant       : 3532   Lower secondary              :  475   
##  Student             :    2   Secondary / secondary special:33988   
##  Unemployed          :    1                                         
##  Working             :24533                                         
##             NAME_FAMILY_STATUS           NAME_HOUSING_TYPE
##  Civil marriage      : 4261    Co-op apartment    :  123  
##  Married             :32283    House / apartment  :43645  
##  Separated           : 2955    Municipal apartment: 1617  
##  Single / not married: 7036    Office apartment   :  407  
##  Widow               : 2209    Rented apartment   :  718  
##                                With parents       : 2234  
##                                                           
##  REGION_POPULATION_RELATIVE   DAYS_BIRTH     DAYS_EMPLOYED    DAYS_REGISTRATION
##  Min.   :0.000253           Min.   :-25195   Min.   :-17463   Min.   :-23722   
##  1st Qu.:0.010006           1st Qu.:-19637   1st Qu.: -2910   1st Qu.: -7459   
##  Median :0.018850           Median :-15785   Median : -1293   Median : -4490   
##  Mean   :0.021226           Mean   :-16068   Mean   : 67485   Mean   : -4968   
##  3rd Qu.:0.028663           3rd Qu.:-12496   3rd Qu.:  -296   3rd Qu.: -1901   
##  Max.   :0.072508           Max.   : -7338   Max.   :365243   Max.   :     0   
##                                                                                
##  DAYS_ID_PUBLISH  OWN_CAR_AGE      FLAG_MOBIL FLAG_EMP_PHONE   FLAG_WORK_PHONE 
##  Min.   :-6348   Min.   : 0.00   Min.   :0    Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:-4448   1st Qu.: 4.00   1st Qu.:1    1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :-3234   Median : 9.00   Median :1    Median :1.0000   Median :0.0000  
##  Mean   :-3052   Mean   :11.79   Mean   :1    Mean   :0.8097   Mean   :0.2047  
##  3rd Qu.:-1706   3rd Qu.:15.00   3rd Qu.:1    3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :    0   Max.   :74.00   Max.   :1    Max.   :1.0000   Max.   :1.0000  
##                  NA's   :32312                                                 
##  FLAG_CONT_MOBILE   FLAG_PHONE       FLAG_EMAIL        OCCUPATION_TYPE 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000              :15605  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000   Laborers   : 8655  
##  Median :1.0000   Median :0.0000   Median :0.0000   Sales staff: 5072  
##  Mean   :0.9984   Mean   :0.2631   Mean   :0.1626   Core staff : 4361  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   Managers   : 3574  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Drivers    : 2773  
##                                                     (Other)    : 8704  
##  CNT_FAM_MEMBERS  REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY
##  Min.   : 1.000   Min.   :1.000        Min.   :-1.000             
##  1st Qu.: 2.000   1st Qu.:2.000        1st Qu.: 2.000             
##  Median : 2.000   Median :2.000        Median : 2.000             
##  Mean   : 2.147   Mean   :2.038        Mean   : 2.013             
##  3rd Qu.: 3.000   3rd Qu.:2.000        3rd Qu.: 2.000             
##  Max.   :21.000   Max.   :3.000        Max.   : 3.000             
##                                                                   
##  WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION
##  FRIDAY   :7250             Min.   : 0.00           Min.   :0.00000           
##  MONDAY   :8406             1st Qu.:10.00           1st Qu.:0.00000           
##  SATURDAY :4603             Median :12.00           Median :0.00000           
##  SUNDAY   :1859             Mean   :12.01           Mean   :0.01883           
##  THURSDAY :8418             3rd Qu.:14.00           3rd Qu.:0.00000           
##  TUESDAY  :9751             Max.   :23.00           Max.   :1.00000           
##  WEDNESDAY:8457                                                               
##  REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY
##  Min.   :0.00000            Min.   :0.00000             Min.   :0.00000       
##  1st Qu.:0.00000            1st Qu.:0.00000             1st Qu.:0.00000       
##  Median :0.00000            Median :0.00000             Median :0.00000       
##  Mean   :0.05517            Mean   :0.04204             Mean   :0.07747       
##  3rd Qu.:0.00000            3rd Qu.:0.00000             3rd Qu.:0.00000       
##  Max.   :1.00000            Max.   :1.00000             Max.   :1.00000       
##                                                                               
##  REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY              ORGANIZATION_TYPE
##  Min.   :0.0000         Min.   :0.0000          Business Entity Type 3:10840  
##  1st Qu.:0.0000         1st Qu.:0.0000          XNA                   : 9274  
##  Median :0.0000         Median :0.0000          Self-employed         : 5920  
##  Mean   :0.2247         Mean   :0.1742          Other                 : 2707  
##  3rd Qu.:0.0000         3rd Qu.:0.0000          Medicine              : 1716  
##  Max.   :1.0000         Max.   :1.0000          Government            : 1508  
##                                                 (Other)               :16779  
##   EXT_SOURCE_1    EXT_SOURCE_2       EXT_SOURCE_3   APARTMENTS_AVG 
##  Min.   :0.013   Min.   :0.000008   Min.   :0.001   Min.   :0.000  
##  1st Qu.:0.344   1st Qu.:0.408066   1st Qu.:0.364   1st Qu.:0.062  
##  Median :0.507   Median :0.558758   Median :0.519   Median :0.093  
##  Mean   :0.501   Mean   :0.518021   Mean   :0.500   Mean   :0.122  
##  3rd Qu.:0.666   3rd Qu.:0.658497   3rd Qu.:0.653   3rd Qu.:0.148  
##  Max.   :0.939   Max.   :0.855000   Max.   :0.883   Max.   :1.000  
##  NA's   :20532   NA's   :8          NA's   :8668    NA's   :23887  
##  BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG 
##  Min.   :0.000    Min.   :0.000               Min.   :0.00    Min.   :0.00   
##  1st Qu.:0.047    1st Qu.:0.977               1st Qu.:0.69    1st Qu.:0.01   
##  Median :0.078    Median :0.982               Median :0.76    Median :0.02   
##  Mean   :0.090    Mean   :0.979               Mean   :0.75    Mean   :0.05   
##  3rd Qu.:0.113    3rd Qu.:0.987               3rd Qu.:0.82    3rd Qu.:0.05   
##  Max.   :1.000    Max.   :1.000               Max.   :1.00    Max.   :1.00   
##  NA's   :27641    NA's   :22856               NA's   :31818   NA's   :33495  
##  ELEVATORS_AVG   ENTRANCES_AVG   FLOORSMAX_AVG   FLOORSMIN_AVG  
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.00   
##  1st Qu.:0.000   1st Qu.:0.074   1st Qu.:0.167   1st Qu.:0.10   
##  Median :0.000   Median :0.138   Median :0.167   Median :0.21   
##  Mean   :0.085   Mean   :0.152   Mean   :0.234   Mean   :0.24   
##  3rd Qu.:0.160   3rd Qu.:0.207   3rd Qu.:0.333   3rd Qu.:0.38   
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.00   
##  NA's   :25189   NA's   :23579   NA's   :23321   NA's   :32466  
##   LANDAREA_AVG   LIVINGAPARTMENTS_AVG LIVINGAREA_AVG  NONLIVINGAPARTMENTS_AVG
##  Min.   :0.000   Min.   :0.00         Min.   :0.000   Min.   :0.00           
##  1st Qu.:0.019   1st Qu.:0.05         1st Qu.:0.049   1st Qu.:0.00           
##  Median :0.048   Median :0.08         Median :0.077   Median :0.00           
##  Mean   :0.067   Mean   :0.11         Mean   :0.112   Mean   :0.01           
##  3rd Qu.:0.087   3rd Qu.:0.13         3rd Qu.:0.138   3rd Qu.:0.01           
##  Max.   :1.000   Max.   :1.00         Max.   :1.000   Max.   :1.00           
##  NA's   :28254   NA's   :32780        NA's   :23552   NA's   :33347          
##  NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE
##  Min.   :0.000     Min.   :0.000   Min.   :0.000    
##  1st Qu.:0.000     1st Qu.:0.059   1st Qu.:0.043    
##  Median :0.004     Median :0.085   Median :0.077    
##  Mean   :0.029     Mean   :0.119   Mean   :0.089    
##  3rd Qu.:0.029     3rd Qu.:0.150   3rd Qu.:0.114    
##  Max.   :1.000     Max.   :1.000   Max.   :1.000    
##  NA's   :26084     NA's   :23887   NA's   :27641    
##  YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE 
##  Min.   :0.000                Min.   :0.00     Min.   :0.00    Min.   :0.000  
##  1st Qu.:0.976                1st Qu.:0.69     1st Qu.:0.01    1st Qu.:0.000  
##  Median :0.982                Median :0.76     Median :0.02    Median :0.000  
##  Mean   :0.978                Mean   :0.76     Mean   :0.05    Mean   :0.081  
##  3rd Qu.:0.987                3rd Qu.:0.82     3rd Qu.:0.05    3rd Qu.:0.121  
##  Max.   :1.000                Max.   :1.00     Max.   :1.00    Max.   :1.000  
##  NA's   :22856                NA's   :31818    NA's   :33495   NA's   :25189  
##  ENTRANCES_MODE  FLOORSMAX_MODE  FLOORSMIN_MODE  LANDAREA_MODE  
##  Min.   :0.000   Min.   :0.000   Min.   :0.00    Min.   :0.000  
##  1st Qu.:0.069   1st Qu.:0.167   1st Qu.:0.08    1st Qu.:0.017  
##  Median :0.138   Median :0.167   Median :0.21    Median :0.046  
##  Mean   :0.147   Mean   :0.229   Mean   :0.23    Mean   :0.066  
##  3rd Qu.:0.207   3rd Qu.:0.333   3rd Qu.:0.38    3rd Qu.:0.086  
##  Max.   :1.000   Max.   :1.000   Max.   :1.00    Max.   :1.000  
##  NA's   :23579   NA's   :23321   NA's   :32466   NA's   :28254  
##  LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE
##  Min.   :0.00          Min.   :0.000   Min.   :0.00            
##  1st Qu.:0.06          1st Qu.:0.046   1st Qu.:0.00            
##  Median :0.08          Median :0.075   Median :0.00            
##  Mean   :0.11          Mean   :0.111   Mean   :0.01            
##  3rd Qu.:0.13          3rd Qu.:0.131   3rd Qu.:0.00            
##  Max.   :1.00          Max.   :1.000   Max.   :1.00            
##  NA's   :32780         NA's   :23552   NA's   :33347           
##  NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI
##  Min.   :0.000      Min.   :0.000   Min.   :0.000    
##  1st Qu.:0.000      1st Qu.:0.062   1st Qu.:0.046    
##  Median :0.001      Median :0.093   Median :0.078    
##  Mean   :0.028      Mean   :0.123   Mean   :0.090    
##  3rd Qu.:0.024      3rd Qu.:0.150   3rd Qu.:0.113    
##  Max.   :1.000      Max.   :1.000   Max.   :1.000    
##  NA's   :26084      NA's   :23887   NA's   :27641    
##  YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI 
##  Min.   :0.000                Min.   :0.00     Min.   :0.00    Min.   :0.000  
##  1st Qu.:0.977                1st Qu.:0.69     1st Qu.:0.01    1st Qu.:0.000  
##  Median :0.982                Median :0.76     Median :0.02    Median :0.000  
##  Mean   :0.979                Mean   :0.75     Mean   :0.05    Mean   :0.084  
##  3rd Qu.:0.987                3rd Qu.:0.82     3rd Qu.:0.05    3rd Qu.:0.160  
##  Max.   :1.000                Max.   :1.00     Max.   :1.00    Max.   :1.000  
##  NA's   :22856                NA's   :31818    NA's   :33495   NA's   :25189  
##  ENTRANCES_MEDI  FLOORSMAX_MEDI  FLOORSMIN_MEDI  LANDAREA_MEDI  
##  Min.   :0.000   Min.   :0.000   Min.   :0.00    Min.   :0.000  
##  1st Qu.:0.069   1st Qu.:0.167   1st Qu.:0.08    1st Qu.:0.019  
##  Median :0.138   Median :0.167   Median :0.21    Median :0.049  
##  Mean   :0.151   Mean   :0.233   Mean   :0.24    Mean   :0.068  
##  3rd Qu.:0.207   3rd Qu.:0.333   3rd Qu.:0.38    3rd Qu.:0.088  
##  Max.   :1.000   Max.   :1.000   Max.   :1.00    Max.   :1.000  
##  NA's   :23579   NA's   :23321   NA's   :32466   NA's   :28254  
##  LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI
##  Min.   :0.00          Min.   :0.000   Min.   :0.00            
##  1st Qu.:0.05          1st Qu.:0.049   1st Qu.:0.00            
##  Median :0.08          Median :0.078   Median :0.00            
##  Mean   :0.11          Mean   :0.113   Mean   :0.01            
##  3rd Qu.:0.13          3rd Qu.:0.137   3rd Qu.:0.00            
##  Max.   :1.00          Max.   :1.000   Max.   :1.00            
##  NA's   :32780         NA's   :23552   NA's   :33347           
##  NONLIVINGAREA_MEDI             FONDKAPREMONT_MODE          HOUSETYPE_MODE 
##  Min.   :0.000                           :32797                    :23619  
##  1st Qu.:0.000      not specified        :  913    block of flats  :24659  
##  Median :0.003      org spec account     :  920    specific housing:  262  
##  Mean   :0.029      reg oper account     :12124    terraced house  :  204  
##  3rd Qu.:0.028      reg oper spec account: 1990                            
##  Max.   :1.000                                                             
##  NA's   :26084                                                             
##  TOTALAREA_MODE     WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
##  Min.   :0.000               :23893       :22209          
##  1st Qu.:0.043   Panel       :11269    No :26179          
##  Median :0.071   Stone, brick:10434    Yes:  356          
##  Mean   :0.107   Block       : 1428                       
##  3rd Qu.:0.136   Wooden      :  794                       
##  Max.   :1.000   Mixed       :  353                       
##  NA's   :22624   (Other)     :  573                       
##  OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE
##  Min.   :  0.000          Min.   : 0.0000          Min.   :  0.000         
##  1st Qu.:  0.000          1st Qu.: 0.0000          1st Qu.:  0.000         
##  Median :  0.000          Median : 0.0000          Median :  0.000         
##  Mean   :  1.448          Mean   : 0.1436          Mean   :  1.436         
##  3rd Qu.:  2.000          3rd Qu.: 0.0000          3rd Qu.:  2.000         
##  Max.   :354.000          Max.   :34.0000          Max.   :351.000         
##  NA's   :29               NA's   :29               NA's   :29              
##  DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2
##  Min.   : 0.0000          Min.   :-4361          Min.   :0      
##  1st Qu.: 0.0000          1st Qu.:-1766          1st Qu.:0      
##  Median : 0.0000          Median : -863          Median :0      
##  Mean   : 0.1011          Mean   :-1078          Mean   :0      
##  3rd Qu.: 0.0000          3rd Qu.: -363          3rd Qu.:0      
##  Max.   :24.0000          Max.   :    0          Max.   :0      
##  NA's   :29                                                     
##  FLAG_DOCUMENT_3  FLAG_DOCUMENT_4     FLAG_DOCUMENT_5   FLAG_DOCUMENT_6  
##  Min.   :0.0000   Min.   :0.0000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:1.0000   1st Qu.:0.0000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :1.0000   Median :0.0000000   Median :0.00000   Median :0.00000  
##  Mean   :0.7866   Mean   :0.0001026   Mean   :0.01475   Mean   :0.08748  
##  3rd Qu.:1.0000   3rd Qu.:0.0000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000000   Max.   :1.00000   Max.   :1.00000  
##                                                                          
##  FLAG_DOCUMENT_7   FLAG_DOCUMENT_8   FLAG_DOCUMENT_9    FLAG_DOCUMENT_10
##  Min.   :0.0e+00   Min.   :0.00000   Min.   :0.000000   Min.   :0       
##  1st Qu.:0.0e+00   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0       
##  Median :0.0e+00   Median :0.00000   Median :0.000000   Median :0       
##  Mean   :4.1e-05   Mean   :0.08846   Mean   :0.004493   Mean   :0       
##  3rd Qu.:0.0e+00   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0       
##  Max.   :1.0e+00   Max.   :1.00000   Max.   :1.000000   Max.   :0       
##                                                                         
##  FLAG_DOCUMENT_11   FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14
##  Min.   :0.000000   Min.   :0        Min.   :0        Min.   :0       
##  1st Qu.:0.000000   1st Qu.:0        1st Qu.:0        1st Qu.:0       
##  Median :0.000000   Median :0        Median :0        Median :0       
##  Mean   :0.001169   Mean   :0        Mean   :0        Mean   :0       
##  3rd Qu.:0.000000   3rd Qu.:0        3rd Qu.:0        3rd Qu.:0       
##  Max.   :1.000000   Max.   :0        Max.   :0        Max.   :0       
##                                                                       
##  FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18  
##  Min.   :0        Min.   :0        Min.   :0        Min.   :0.000000  
##  1st Qu.:0        1st Qu.:0        1st Qu.:0        1st Qu.:0.000000  
##  Median :0        Median :0        Median :0        Median :0.000000  
##  Mean   :0        Mean   :0        Mean   :0        Mean   :0.001559  
##  3rd Qu.:0        3rd Qu.:0        3rd Qu.:0        3rd Qu.:0.000000  
##  Max.   :0        Max.   :0        Max.   :0        Max.   :1.000000  
##                                                                       
##  FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR
##  Min.   :0        Min.   :0        Min.   :0        Min.   :0.000             
##  1st Qu.:0        1st Qu.:0        1st Qu.:0        1st Qu.:0.000             
##  Median :0        Median :0        Median :0        Median :0.000             
##  Mean   :0        Mean   :0        Mean   :0        Mean   :0.002             
##  3rd Qu.:0        3rd Qu.:0        3rd Qu.:0        3rd Qu.:0.000             
##  Max.   :0        Max.   :0        Max.   :0        Max.   :2.000             
##                                                     NA's   :6049              
##  AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON
##  Min.   :0.000             Min.   :0.000              Min.   :0.000            
##  1st Qu.:0.000             1st Qu.:0.000              1st Qu.:0.000            
##  Median :0.000             Median :0.000              Median :0.000            
##  Mean   :0.002             Mean   :0.003              Mean   :0.009            
##  3rd Qu.:0.000             3rd Qu.:0.000              3rd Qu.:0.000            
##  Max.   :2.000             Max.   :2.000              Max.   :6.000            
##  NA's   :6049              NA's   :6049               NA's   :6049             
##  AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
##  Min.   :0.000             Min.   : 0.000            
##  1st Qu.:0.000             1st Qu.: 0.000            
##  Median :0.000             Median : 2.000            
##  Mean   :0.547             Mean   : 1.984            
##  3rd Qu.:1.000             3rd Qu.: 3.000            
##  Max.   :7.000             Max.   :17.000            
##  NA's   :6049              NA's   :6049

2.3.1 Notes

Most data appears to be in a good condition. Some items that are important to note are, the number of females int he data set is about twice the size of males. The number of people who don’t own a car is about twice the size that do, for homes, the umber who own a home are about twice the size of those who don’t. The number of children has a max number of 19. Which is possible, but very high. They days employed number appears to have some incorrect data. The highest number is over 365,000. Meaning that is 1000 years. No one is quite that age. This age could also throw off the mean.

3 Feature engineering

3.1 Remove IDs

#Remove the ID column
app_train <- app_train[ -c(1) ]

As IDs are not conclusive for determening someones credit default risk, we will remove it.

3.2 Normalize income

# Normalize the income observation
app_train$Sqrt_Income <- app_train$AMT_INCOME_TOTAL %>% sqrt()

3.3 Expand Region Population Relative

# Expand the Region Population Relative
app_train$RPR_Squared <- (app_train$REGION_POPULATION_RELATIVE)^2

3.4 Daily Income

#Add a new column to track daily income
app_train$Daily_Income <- (app_train$AMT_INCOME_TOTAL)/365

Adding the above columns can help us normalize income so they are all on similar levels, find greater differences in the region population, and track the daily income of individuals.

4 Correlation

#Build a correlation matrix comparing several variables. Uncomment for an output.

# app_train %>% select(TARGET, DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_BIRTH, AMT_GOODS_PRICE, AMT_INCOME_TOTAL, AMT_CREDIT, NAME_INCOME_TYPE, OCCUPATION_TYPE, HOUR_APPR_PROCESS_START, WEEKDAY_APPR_PROCESS_START) %>% pairs.panels()

app_train %>% select(TARGET, Daily_Income, RPR_Squared, Sqrt_Income) %>% pairs.panels()

Based on the correlation matrix, we see that There are not high numbers of correlation between the target variable and others.Finding strong predictors of the target variables will require additional work.

#Plots ## Scaatterplots

#Build a scatter plot showing the AMT_CREDIT AND AMT_INCOME_TOTAL broken with the color being the TARGET
  ggplot(data = app_train, mapping = aes( x = AMT_CREDIT, y = AMT_INCOME_TOTAL, colour=TARGET)) +
  geom_point() +
     labs(title = "AMT_CREDIT AND AMT_INCOME_TOTAL broken down by TARGET")

This scatter plot shows that there is not a strong relationship between AMT_Credit and AMT_INCOME_TOTAL in predicting the target variable. Most incomes hover around the same range of numbers

4.1 Barplot

#Build a bar plot showing the Target variables broken down by occupation type
  ggplot(data = app_train, mapping = aes(x = OCCUPATION_TYPE)) +
     geom_bar() +
     facet_wrap(facets = ~TARGET, ncol = 1) +
     labs(title = "Bar plot of Occupation Type by Target")

This chart provides some interesting insights. Laborers, drivers, and sales staff have the highest number of challenges with paying back loans. Coincidentally, those three occupations are also some of the highest, with paying back loans, meaning a large number of individuals in this data set fall into those categories.

#Build a bar plot showing the Target variables broken down by occupation type
  ggplot(data = app_train, mapping = aes(x = NAME_INCOME_TYPE)) +
     geom_bar() +
     facet_wrap(facets = ~TARGET, ncol = 1) +
     labs(title = "Bar plot of Income Type by Target")

This chart provides some similar insights as the one above insights. Working and Commercial associate have the highest number of challenges with paying back loans. Similarly, Working and Commercial associate have the highest number of those who pay back loans. We see this pattern throughout the data.

#Build a bar plot showing the Target variables broken down by occupation type
  ggplot(data = app_train, mapping = aes(x = NAME_EDUCATION_TYPE)) +
     geom_bar() +
     facet_wrap(facets = ~TARGET, ncol = 1) +
     labs(title = "Bar plot of Education Type by Target")

This chart provides some interesting inights. Secondary / secondary special have the highest number of challenges with paying back loans. Similarly, secondary / secondary special also have the highest number of those who pay back loans. We see this pattern throughout the data.

4.2 Boxplot

#Build a boxplot of the Target variable compared to the AMT_INCOME_TOTAL
  ggplot(data = app_train, mapping = aes(x = TARGET, y = AMT_INCOME_TOTAL)) + 
    geom_boxplot() +
     labs(title = "Boxplot of Income by Target")

This shows us there is a large outlier with the 1 target. We will want to remove that outlier and rerun the plot to achieve better results.

5 Conclusion

This data set is extremely large and complex. It has been difficult to find patterns of characteristics of those who don’t have payment difficulties and those that do. That data seems relatively clean, though there are some outliers/questionable observations. More complex models, and feature engineering will be needed to gather deeper insights.

6 Models

6.1 Decision Trees

#train_tree <- C5.0(formula = TARGET ~ .,data = app_train)
#train_tree$size

#plot(train_tree)

#train_tree <- rpart(TARGET~., data=app_train, method = 'class')
#rpart.plot(train_tree)

The tree model code is commented out but provided for future use in building models.

6.2 Logistic Regression

# Simple Logistic Regression Model
default_model <- app_train %>% glm(formula = TARGET ~ DAYS_REGISTRATION + DAYS_BIRTH + WEEKDAY_APPR_PROCESS_START + HOUR_APPR_PROCESS_START, family = "binomial")

#Show the logistic regression model
summary(default_model)
## 
## Call:
## glm(formula = TARGET ~ DAYS_REGISTRATION + DAYS_BIRTH + WEEKDAY_APPR_PROCESS_START + 
##     HOUR_APPR_PROCESS_START, family = "binomial", data = .)
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         -8.974e-01  3.888e-02 -23.080   <2e-16 ***
## DAYS_REGISTRATION                    1.979e-05  2.118e-06   9.346   <2e-16 ***
## DAYS_BIRTH                           6.528e-05  1.665e-06  39.209   <2e-16 ***
## WEEKDAY_APPR_PROCESS_STARTMONDAY    -5.976e-02  2.335e-02  -2.559   0.0105 *  
## WEEKDAY_APPR_PROCESS_STARTSATURDAY  -6.531e-02  2.604e-02  -2.508   0.0121 *  
## WEEKDAY_APPR_PROCESS_STARTSUNDAY    -8.411e-02  3.349e-02  -2.512   0.0120 *  
## WEEKDAY_APPR_PROCESS_STARTTHURSDAY  -7.859e-03  2.313e-02  -0.340   0.7340    
## WEEKDAY_APPR_PROCESS_STARTTUESDAY    2.854e-02  2.263e-02   1.261   0.2071    
## WEEKDAY_APPR_PROCESS_STARTWEDNESDAY  5.090e-04  2.294e-02   0.022   0.9823    
## HOUR_APPR_PROCESS_START             -3.448e-02  2.031e-03 -16.972   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 172542  on 307510  degrees of freedom
## Residual deviance: 170212  on 307501  degrees of freedom
## AIC: 170232
## 
## Number of Fisher Scoring iterations: 5

This is a simple Logistic Regression model only viewing several variables. We would want to build a more advanced to gather future insights and see if the p-values stay consistant.